Search CORE

D-Scholarship@Pitt

FigShare

In silico prioritisation of candidate genes for prokaryotic gene function discovery: an application of phylogenetic profiles

Author: C Médigue
C Perez-Iratxeta
CM Fraser
DM Raskin
EA Adie
EA Adie
EC Lin
EM Marcotte
Enrico Coiera
Frank PY Lin
FS Turner
G Michal
IH Witten
J Freudenberg
J Wu
JP Gogarten
JP Vert
KJ Gaulton
M Kanehisa
M Pellegrini
MY Galperin
N López-Bigas
N Tiffin
PD Karp
R Jothi
Ruiting Lan
S Aerts
Vitali Sintchenko
WJ Kent
Y Yamanishi
Y Zheng
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: In silico candidate gene prioritisation (CGP) aids the discovery of gene functions by ranking genes according to an objective relevance score. While several CGP methods have been described for identifying human disease genes, corresponding methods for prokaryotic gene function discovery are lacking. Here we present two prokaryotic CGP methods, based on phylogenetic profiles, to assist with this task. Results: Using gene occurrence patterns in sample genomes, we developed two CGP methods (statistical and inductive CGP) to assist with the discovery of bacterial gene functions. Statistical CGP exploits the differences in gene frequency against phenotypic groups, while inductive CGP applies supervised machine learning to identify gene occurrence pattern across genomes. Three rediscovery experiments were designed to evaluate the CGP frameworks. The first experiment attempted to rediscover peptidoglycan genes with 417 published genome sequences. Both CGP methods achieved best areas under receiver operating characteristic curve (AUC) of 0.911 in Escherichia coli K-12 (EC-K12) and 0.978 Streptococcus agalactiae 2603 (SA-2603) genomes, with an average improvement in precision of >3.2-fold and a maximum of >27-fold using statistical CGP. A median AUC of >0.95 could still be achieved with as few as 10 genome examples in each group of genome examples in the rediscovery of the peptidoglycan metabolism genes. In the second experiment, a maximum of 109-fold improvement in precision was achieved in the rediscovery of anaerobic fermentation genes in EC-K12. The last experiment attempted to rediscover genes from 31 metabolic pathways in SA-2603, where 14 pathways achieved AUC >0.9 and 28 pathways achieved AUC >0.8 with the best inductive CGP algorithms. Conclusion: Our results demonstrate that the two CGP methods can assist with the study of functionally uncategorised genomic regions and discovery of bacterial gene-function relationships. Our rediscovery experiments also provide a set of standard tasks against which future methods may be compared.12 page(s

Macquarie University ResearchOnline

UNSWorks

BICEPP: an example-based statistical text mining method for predicting the binary characteristics of drugs

Author: A Korhonen
A Koussounadis
C Perez-Iratxeta
C Perez-Iratxeta
CB Giles
D Fourches
DR Swanson
EA Adie
EA Adie
EC Fieller
F Hammann
Frank PY Lin
FS Turner
GR Grimes
Guy Tsafnat
H Gurulingappa
J Freudenberg
JA Hanley
KJ Gaulton
L Màrquez
M Hall
M Krallinger
Matthew P Doogue
MF Porter
N López-Bigas
N Tiffin
P Srinivasan
RJ Epstein
S Aerts
S Raychaudhuri
S Raychaudhuri
S Rossi
S Tatar
S Yu
Stephen Anthony
Thomas M Polasek
TM Polasek
V Sintchenko
Y Garten
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The identification of drug characteristics is a clinically important task, but it requires much expert knowledge and consumes substantial resources. We have developed a statistical text-mining approach (BInary Characteristics Extractor and biomedical Properties Predictor: BICEPP) to help experts screen drugs that may have important clinical characteristics of interest. Results BICEPP first retrieves MEDLINE abstracts containing drug names, then selects tokens that best predict the list of drugs which represents the characteristic of interest. Machine learning is then used to classify drugs using a document frequency-based measure. Evaluation experiments were performed to validate BICEPP's performance on 484 characteristics of 857 drugs, identified from the Australian Medicines Handbook (AMH) and the PharmacoKinetic Interaction Screening (PKIS) database. Stratified cross-validations revealed that BICEPP was able to classify drugs into all 20 major therapeutic classes (100%) and 157 (of 197) minor drug classes (80%) with areas under the receiver operating characteristic curve (AUC) > 0.80. Similarly, AUC > 0.80 could be obtained in the classification of 173 (of 238) adverse events (73%), up to 12 (of 15) groups of clinically significant cytochrome P450 enzyme (CYP) inducers or inhibitors (80%), and up to 11 (of 14) groups of narrow therapeutic index drugs (79%). Interestingly, it was observed that the keywords used to describe a drug characteristic were not necessarily the most predictive ones for the classification task. Conclusions BICEPP has sufficient classification power to automatically distinguish a wide range of clinical properties of drugs. This may be used in pharmacovigilance applications to assist with rapid screening of large drug databases to identify important characteristics for further evaluation.</p

Macquarie University ResearchOnline

ProphNet: A generic prioritization method through propagation of information

Author: A Hamosh
A Molven
A Naderi
A Rökman
AL Barabási
AL Gloyn
AP Babenko
Armando Blanco
B Raghavachari
C Fenoglio
C Van Duijn
Carlos Cano
D Zhou
E Jain
EA Adie
EA Adie
EC van Hove
G Chenevix-Trench
GS Wilkie
JBJ Kwok
KJ Gaulton
LE Wold
MA van Driel
N Aziz
N Rahman
O Vanunu
O Vanunu
P Vahteristo
PJ Westenend
RD Finn
S Aerts
S Köhler
S Navlakha
S Peri
SK Ng
T Buterin
T Hwang
T Walsh
V Martínez
Víctor Martínez
W Wang
W Zhang
X Wang
X Wu
Y Li
Y Moreau
Y Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 1, 2014: Integrated Bio-Search: Selected Works from the 12th International Workshop on Network Tools and Applications in Biology (NETTAB 2012).[Background] Prioritization methods have become an useful tool for mining large amounts of data to suggest promising hypotheses in early research stages. Particularly, network-based prioritization tools use a network representation for the interactions between different biological entities to identify novel indirect relationships. However, current network-based prioritization tools are strongly tailored to specific domains of interest (e.g. gene-disease prioritization) and they do not allow to consider networks with more than two types of entities (e.g. genes and diseases). Therefore, the direct application of these methods to accomplish new prioritization tasks is limited.[Results] This work presents ProphNet, a generic network-based prioritization tool that allows to integrate an arbitrary number of interrelated biological entities to accomplish any prioritization task. We tested the performance of ProphNet in comparison with leading network-based prioritization methods, namely rcNet and DomainRBF, for gene-disease and domain-disease prioritization, respectively. The results obtained by ProphNet show a significant improvement in terms of sensitivity and specificity for both tasks. We also applied ProphNet to disease-gene prioritization on Alzheimer, Diabetes Mellitus Type 2 and Breast Cancer to validate the results and identify putative candidate genes involved in these diseases.[Conclusions] ProphNet works on top of any heterogeneous network by integrating information of different types of biological entities to rank entities of a specific type according to their degree of relationship with a query set of entities of another type. Our method works by propagating information across data networks and measuring the correlation between the propagated values for a query and a target sets of entities. ProphNet is available at: http://genome2.ugr.es/prophnet webcite. A Matlab implementation of the algorithm is also available at the website.This work was part of projects P08-TIC-4299 of J. A., Sevilla and TIN2009-13489 of DGICT, Madrid. It was also supported by Plan Propio de Investigación, University of Granada

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Granada

Integration of multiple data sources to prioritize candidate genes using discounted rating system

Author: A Hamosh
A similarity-based method for genome-wide prediction of disease-relevant human genes
C Perez-Iratxeta
C Stark
D Botstein
D Lin
EA Adie
F Turner
J Xu
Jagdish C Patra
JJ Jiang
JM Stuart
K Järvelin
L Lovasz
LC Tranchevent
M Mistry
MA Harris
MG Anne
N López-Bigas
P Resnik
S Aerts
S Kohler
S Peri
T De Bie
Y Li
Yongjin Li
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Background: Identifying disease gene from a list of candidate genes is an important task in bioinformatics. The main strategy is to prioritize candidate genes based on their similarity to known disease genes. Most of existing gene prioritization methods access only one genomic data source, which is noisy and incomplete. Thus, there is a need for the integration of multiple data sources containing different information. Results: In this paper, we proposed a combination strategy, called discounted rating system (DRS). We performed leave one out cross validation to compare it with N-dimensional order statistics (NDOS) used in Endeavour. Results showed that the AUC (Area Under the Curve) values achieved by DRS were comparable with NDOS on most of the disease families. But DRS worked much faster than NDOS, especially when the number of data sources increases. When there are 100 candidate genes and 20 data sources, DRS works more than 180 times faster than NDOS. In the framework of DRS, we give different weights for different data sources. The weighted DRS achieved significantly higher AUC values than NDOS. Conclusion: The proposed DRS algorithm is a powerful and effective framework for candidate gene prioritization. If weights of different data sources are proper given, the DRS algorithm will perform better

DR-NTU (Digital Repository of NTU)

Swinburne Research Bank

Expression pattern of drought stress marker genes in soybean roots under two water deficit systems

arXiv.org e-Print Archive

ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

Author: A Su
B Brancotte
B Calvo
B Linghu
B Liu
B Schölkopf
B Schölkopf
B Schölkopf
C Giallourakis
C Perez-Iratxeta
C Son
CC Chang
EA Adie
F Denis
F Mordelet
Fantine Mordelet
FS Turner
G Lanckriet
GRG Lanckriet
J Freudenberg
Jean-Philippe Vert
K Bleakley
K Lage
L Jacob
L Jacob
LC Tranchevent
M van Driel
N López-Bigas
N Tiffin
O Vanunu
P Pavlidis
RI Kondor
S Aerts
S Köhler
S Yu
T De Bie
T Evgeniou
T Hwang
U Ala
V McKusick
X Wu
Y Yamanishi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases. Results We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases. Conclusions ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p

Systematic analysis, comparison, and integration of disease based human genetic association data and mouse genetic phenotypic information

Author: A Subramanian
AG Heidema
AJ Butte
BK Lin
BT Sherman
DI Chasman
DM Evans
EA Adie
F Bresso
H Mei
HS Chai
J Ward
JH Choi
JM Hancock
John R Garner
JP Ioannidis
Kevin G Becker
KG Becker
KG Becker
KG Becker
KI Goh
Kirstin Smith
M Holden
M Liu
M Slatkin
M Yi
MA Cheh
MB Eisen
MJ Khoury
N Gharani
NR Wray
P Yue
RM Plenge
S Alex Wang
S Ray
SE Harris
SL Zheng
Supriyo De
SY Kim
V Emilsson
VA McKusick
W Huang da
WM Fitch
X Wang
X Wu
Y Guan
YH Lee
Yonqing Zhang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

A literature-based similarity metric for biological processes

Author: A Hyvarinen
A Tanay
AA Petti
AB Maxfield
AG Fraser
AH Tong
Alberto Pascual-Montano
CD Powell
Concha Gil
D Chaussabel
D Lin
D Martin
DD Lee
DE Levin
DM Blei
E Ravasz
EA Adie
G Weeks
H Shatkay
HS Carr
J Tuikkala
Jose M Carazo
L Giot
LH Hartwell
M Ashburner
M Chagoyen
M Vidal
MF Porter
Monica Chagoyen
NJ Krogan
O Bodenreider
P Glenisson
P Khatri
P Pehkonen
P Resnik
P Resnik
Pedro Carmona-Saez
PV Ogren
PW Lord
PW Lord
R Homayouni
RB Cattell
S Deerwester
S Deerwester
S Myhre
T Hofmann
T Sekito
T Yu
U Alon
VL Boyartchuk
X Wu
Z Bar-Joseph
ZN Oltvai
Publication venue: BioMed Central
Publication date: 01/07/2006
Field of study

BACKGROUND: Recent analyses in systems biology pursue the discovery of functional modules within the cell. Recognition of such modules requires the integrative analysis of genome-wide experimental data together with available functional schemes. In this line, methods to bridge the gap between the abstract definitions of cellular processes in current schemes and the interlinked nature of biological networks are required. RESULTS: This work explores the use of the scientific literature to establish potential relationships among cellular processes. To this end we haveused a document based similarity method to compute pair-wise similarities of the biological processes described in the Gene Ontology (GO). The method has been applied to the biological processes annotated for the Saccharomyces cerevisiae genome. We compared our results with similarities obtained with two ontology-based metrics, as well as with gene product annotation relationships. We show that the literature-based metric conserves most direct ontological relationships, while reveals biologically sounded similarities that are not obtained using ontology-based metrics and/or genome annotation. CONCLUSION: The scientific literature is a valuable source of information from which to compute similarities among biological processes. The associations discovered by literature analysis are a valuable complement to those encoded in existing functional schemes, and those that arise by genome annotation. These similarities can be used to conveniently map the interlinked structure of cellular processes in a particular organism

Digital.CSIC

MOBAS: identification of disease-associated protein subnetworks using modularity-based scoring

Author: A Agresti
A Clauset
A Strange
A Torkamani
C Brorsson
C Grunfeld
C Perez-Iratxeta
CJ Gallagher
D Altshuler
D Delling
D Maglott
EA Adie
EA Adie
F Vandin
FS Turner
GC Linderman
H Ma
J-Y Deng
JE Lim
JL Fleiss
K Wang
M Girvan
M Van Driel
Mark R. Chance
Marzieh Ayati
ME Newman
Mehmet Koyutürk
N López-Bigas
N Tiffin
N Tiffin
O Vanunu
P Jia
R Yamada
RP Nair
S Erten
S Fortunato
S Maslov
S Purcell
SE Baranzini
Sinan Erten
T Ideker
TJ Russell
WTCC Consortium
Y Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study